[edit | delete | history]

Living Semantic Web

Does the Semantic Web behave like a living system?

Living Semantic Web > Semantic Web Graph > RDF Crawler

 

The New RDFCrawler is a modification of the existing RDFCrawler. The RDF API has been updated to Jena in order to cope with the greatest amout of RDF metadata available in the Web. Moreover, some other changes have been introduce to improve its capabilities. They are summarised in the next points:

  • Memory management: the new version does not use exclusively an "in-memory" approach. Once an URL with RDF metadata has been loaded and parsed, it is serialised to a common file in N-Triples form. Therefore, once serialised, memory can be freed and the process is iteratively applied builing the global RDF model in the disk file. This allows crawling much bigger RDF models.
  • DAML Ontology Library: it has been extended an now is able to extract the starting URLs from an HTML file containing a list of them, for instance the DAML Ontology Library.
  • RDF to Pajek Net translator: there is a separated translator from input RDF models, serialised in RDF/XML or N-Triples form, to Pajek nets in ".net" format. From RDF triples, subjects and objects become network nodes connected by directed edges from subject to object.

Installation

Download an unzip LivingSW.zip. It contains the source code (/src), compiled code (/bin), a regular expressions package (/lib) and a pair of useful scripts. Moreover, the New RDFCrawler requires some libraries from Jena to be placed at /lib . It has been tested with those from the Jena 1.6.1 version: jena.jar, icu4j.jar, xerces.jar, junit.jar, concurrent-1.3.0.jar

Use

The different functionalities of the New RDFCrawler are packed in the two provided scripts. The first one launches the crawler. There are two options, the first one crawls from URL for the given time and crawling depth. The second, pre-processes the given HTML URL to extract the URL from which the crawling will be performed.

> rdfcrawl URL [depth :int] [time :int]
> rdfcrawl base :htmlURL [depth :int] [time :int]

The other script is used to convert the N-Triples RDF model produced by the crawler to a Pajek Net. Moreover, It can alco convert RDF/XML input serialisations files:

> nt2pajek rdfserialisationfile(.nt|.xml)